Traitement

Packages et fonctions externes

Certaines variables qualitatives ont des modalités très peu représentées (fréquence relative inférieure à 5%). Il est nécessaire de fusionner certaines modalités pour combler ce problème. Il serait inquiétant de laisser les algorithmes apprendre sur une poignées d’observations pour comprendre l’interaction d’une modalité avec son environnement (autres observations, autres modalités, autres variables, target).

Deux stratégies sont adoptées ici pour fusionner des modalités :

  • Projetter les modalités sur le premier plan factoriel d’une AFDM et regrouper les modalités projetées dans la même direction (clusters).
  • Fusionner les modalités en fonction de leur taux de target positive : les modalités ayant un comportement similaire vis-à-vis de la target seront rassemblées si besoin.

Fusion des modalités

AFDM

Régression logistique sur variable à 104 modalités

Warning: Column `ps_car_11_cat` joining factors with different levels,
coercing to character vector
Warning: Column `ps_car_11_cat` joining factors with different levels,
coercing to character vector

Après avoir fusionné les modalités selon deux stratégies différentes, on peut entraîner un LGBM pour voir l’impact de ces traitements sur la prédictivité du modèle.

Benchmark - LGBM (sur données AFDM)

set.seed(1234)
params_lgb = list(
  objective        = "binary", # type of exercise
  metric           = "auc",    # metric to be evaluated
  learning_rate    = 0.01,     # shrinkage rate
  max_depth        = 10,       # max depth for tree model (used to deal with over-fitting when data is small)
  num_leaves       = 20,       # max number of leaves (nodes) in one tree
  is_unbalance     = T,        # is data unbalanced
  min_data_in_leaf = 1,        # min number of data in one leaf (used to deal with over-fitting)
  feature_fraction = 0.8,      # randomly select part of the features on each iteration
  bagging_fraction = 0.8,      # randomly select part of the data without resampling
  bagging_freq     = 5,        # if != 0, enables bagging, performs bagging at every k iteration
  num_threads      = 6         # number of cpu cores (not threads) to use
)

post_fusion_afdm_lgb_cv = lgb.cv(
  params                = params_lgb,         # hyperparameters
  data                  = train_XY_lgb,       # lgb.Dataset object for training
  eval                  = lgb.normalizedgini, # custom metric, additionnal to first metric
  nrounds               = 1000,               # maximum iterations
  early_stopping_rounds = 50,                 # if metric evaluation doesn't increase
  verbose               = 1,                  # enable verbose
  eval_freq             = 50,                 # verbose every n iterations
  nfold                 = 5                   # k-folds CV
)

post_fusion_afdm_lgb_model <- lgb.train(
  params    = params_lgb,                        # hyperparameters
  data      = train_XY_lgb,                      # lgb.Dataset object for training
  valids    = list(train = train_XY_lgb),        # lgb.Dataset object for validation
  eval      = lgb.normalizedgini,                # custom metric, additionnal to first metric
  nrounds   = post_fusion_afdm_lgb_cv$best_iter, # nrounds from CV
  verbose   = 1,                                 # enable verbose
  eval_freq = 50                                 # verbose every n iterations
)
       Normalized Gini Coeff. (Train) 
                            0.3491641 
Normalized Gini Coeff. (Valid - 5fCV) 
                            0.2728841 
        Normalized Gini Coeff. (Test) 
                            0.2827741 

Benchmark - LGBM (sur données TABC)

set.seed(1234)
params_lgb = list(
  objective        = "binary", # type of exercise
  metric           = "auc",    # metric to be evaluated
  learning_rate    = 0.01,     # shrinkage rate
  max_depth        = 10,       # max depth for tree model (used to deal with over-fitting when data is small)
  num_leaves       = 20,       # max number of leaves (nodes) in one tree
  is_unbalance     = T,        # is data unbalanced
  min_data_in_leaf = 1,        # min number of data in one leaf (used to deal with over-fitting)
  feature_fraction = 0.8,      # randomly select part of the features on each iteration
  bagging_fraction = 0.8,      # randomly select part of the data without resampling
  bagging_freq     = 5,        # if != 0, enables bagging, performs bagging at every k iteration
  num_threads      = 6         # number of cpu cores (not threads) to use
)

post_fusion_tabc_lgb_cv = lgb.cv(
  params                = params_lgb,         # hyperparameters
  data                  = train_XY_lgb,       # lgb.Dataset object for training
  eval                  = lgb.normalizedgini, # custom metric, additionnal to first metric
  nrounds               = 1000,               # maximum iterations
  early_stopping_rounds = 50,                 # if metric evaluation doesn't increase
  verbose               = 1,                  # enable verbose
  eval_freq             = 50,                 # verbose every n iterations
  nfold                 = 5                   # k-folds CV
)

post_fusion_tabc_lgb_model <- lgb.train(
  params    = params_lgb,                        # hyperparameters
  data      = train_XY_lgb,                      # lgb.Dataset object for training
  valids    = list(train = train_XY_lgb),        # lgb.Dataset object for validation
  eval      = lgb.normalizedgini,                # custom metric, additionnal to first metric
  nrounds   = post_fusion_tabc_lgb_cv$best_iter, # nrounds from CV
  verbose   = 1,                                 # enable verbose
  eval_freq = 50                                 # verbose every n iterations
)
       Normalized Gini Coeff. (Train) 
                            0.3531964 
Normalized Gini Coeff. (Valid - 5fCV) 
                            0.2756634 
        Normalized Gini Coeff. (Test) 
                            0.2847164 
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells   3199621  170.9    5943202  317.5   4664114  249.1
Vcells 168182274 1283.2  282796496 2157.6 282631105 2156.4

Les fusions par AFDM ont fortement diminué le coefficient de Gini. Par conséquent, on garde uniquement les fusions par tableaux de contingence, bien qu’elles diminuent légèrement le score. Ces fusions sont nécessaires et le coût (perte du score) reste raisonnable.

Exportation

Après avoir transformé les deux échantillons, on peut les rediviser selon la colonne “dataset”. On exportera deux bases dans ce cas (ne pas oublier de récupérer le type des colonnes et exporter ça également).

D’autres transformations sont envisagées, si celles-ci ne sont pas obligatoires mais recommandées. En revanche, si elles font diminuer le score, elles ne seront pas retenues. On travaillera alors avec les bases actuelles par la suite.

Transformation des variables continues

 [1] "id"         "ps_car_13"  "ps_reg_03"  "ps_ind_03"  "ps_reg_01" 
 [6] "ps_ind_15"  "ps_car_14"  "ps_car_15"  "ps_ind_01"  "ps_reg_02" 
[11] "ps_car_12"  "ps_calc_14" "ps_calc_10" "ps_calc_03" "ps_calc_02"
[16] "ps_calc_11" "ps_car_11" 

Benchmark - LGBM (sur données AFDM)

set.seed(1234)
params_lgb = list(
  objective        = "binary", # type of exercise
  metric           = "auc",    # metric to be evaluated
  learning_rate    = 0.01,     # shrinkage rate
  max_depth        = 10,       # max depth for tree model (used to deal with over-fitting when data is small)
  num_leaves       = 20,       # max number of leaves (nodes) in one tree
  is_unbalance     = T,        # is data unbalanced
  min_data_in_leaf = 1,        # min number of data in one leaf (used to deal with over-fitting)
  feature_fraction = 0.8,      # randomly select part of the features on each iteration
  bagging_fraction = 0.8,      # randomly select part of the data without resampling
  bagging_freq     = 5,        # if != 0, enables bagging, performs bagging at every k iteration
  num_threads      = 6         # number of cpu cores (not threads) to use
)

post_traite_afdm_lgb_cv = lgb.cv(
  params                = params_lgb,         # hyperparameters
  data                  = train_XY_lgb,       # lgb.Dataset object for training
  eval                  = lgb.normalizedgini, # custom metric, additionnal to first metric
  nrounds               = 1000,               # maximum iterations
  early_stopping_rounds = 50,                 # if metric evaluation doesn't increase
  verbose               = 1,                  # enable verbose
  eval_freq             = 50,                 # verbose every n iterations
  nfold                 = 5                   # k-folds CV
)

post_traite_afdm_lgb_model <- lgb.train(
  params    = params_lgb,                        # hyperparameters
  data      = train_XY_lgb,                      # lgb.Dataset object for training
  valids    = list(train = train_XY_lgb),        # lgb.Dataset object for validation
  eval      = lgb.normalizedgini,                # custom metric, additionnal to first metric
  nrounds   = post_traite_afdm_lgb_cv$best_iter, # nrounds from CV
  verbose   = 1,                                 # enable verbose
  eval_freq = 50                                 # verbose every n iterations
)
       Normalized Gini Coeff. (Train) 
                            0.3489638 
Normalized Gini Coeff. (Valid - 5fCV) 
                            0.2728511 
        Normalized Gini Coeff. (Test) 
                            0.2826310 

Benchmark - LGBM (sur données TABC)

set.seed(1234)
params_lgb = list(
  objective        = "binary", # type of exercise
  metric           = "auc",    # metric to be evaluated
  learning_rate    = 0.01,     # shrinkage rate
  max_depth        = 10,       # max depth for tree model (used to deal with over-fitting when data is small)
  num_leaves       = 20,       # max number of leaves (nodes) in one tree
  is_unbalance     = T,        # is data unbalanced
  min_data_in_leaf = 1,        # min number of data in one leaf (used to deal with over-fitting)
  feature_fraction = 0.8,      # randomly select part of the features on each iteration
  bagging_fraction = 0.8,      # randomly select part of the data without resampling
  bagging_freq     = 5,        # if != 0, enables bagging, performs bagging at every k iteration
  num_threads      = 6         # number of cpu cores (not threads) to use
)

post_traite_tabc_lgb_cv = lgb.cv(
  params                = params_lgb,         # hyperparameters
  data                  = train_XY_lgb,       # lgb.Dataset object for training
  eval                  = lgb.normalizedgini, # custom metric, additionnal to first metric
  nrounds               = 1000,               # maximum iterations
  early_stopping_rounds = 50,                 # if metric evaluation doesn't increase
  verbose               = 1,                  # enable verbose
  eval_freq             = 50,                 # verbose every n iterations
  nfold                 = 5                   # k-folds CV
)

post_traite_tabc_lgb_model <- lgb.train(
  params    = params_lgb,                        # hyperparameters
  data      = train_XY_lgb,                      # lgb.Dataset object for training
  valids    = list(train = train_XY_lgb),        # lgb.Dataset object for validation
  eval      = lgb.normalizedgini,                # custom metric, additionnal to first metric
  nrounds   = post_traite_tabc_lgb_cv$best_iter, # nrounds from CV
  verbose   = 1,                                 # enable verbose
  eval_freq = 50                                 # verbose every n iterations
)
       Normalized Gini Coeff. (Train) 
                            0.3419099 
Normalized Gini Coeff. (Valid - 5fCV) 
                            0.2751949 
        Normalized Gini Coeff. (Test) 
                            0.2840637 

Les scores ont diminué pour les deux bases, de façon significative pour des transformations (normalisation de 4 variables continues). On laisse tomber ces transformations pour rester dans l’optique d’optimisation. En revanche, on garde les fusions de modalités des variables qualitatives par tableaux de contingence parce qu’elles sont nécessaires bien que le score diminue légèrement.

Étape 4